DDPG (Deep Deterministic Policy Gradient) — from scratch in PyTorch#

DDPG is an off-policy actor–critic algorithm for continuous control.

This notebook implements DDPG at a low level in PyTorch:

  • replay buffer

  • actor + critic networks

  • target networks with soft updates

  • exploration noise for deterministic policies

  • a clean training loop + Plotly diagnostics

We’ll train on a simple continuous environment (default: Pendulum-v1) and plot:

  • score per episode (learning curve)

  • Q-values / TD targets during learning

  • policy evolution on fixed probe states

Learning goals#

By the end you should be able to:

  • explain the actor–critic factorization in DDPG and what each network learns

  • write the critic target with target networks precisely

  • understand why DDPG needs (1) experience replay and (2) target networks

  • implement DDPG updates in low-level PyTorch (no RL libraries)

  • interpret common diagnostics: returns, losses, Q-values, and policy drift

Prerequisites#

  • basic PyTorch (nn.Module, optimizers)

  • Bellman equation / TD learning intuition

  • continuous action spaces (Box)

1) DDPG structure (actor–critic) and target networks#

Actor (deterministic policy)#

The actor is a deterministic policy network:

\[a = \mu_\theta(s)\]

In practice we output a tanh-bounded action and then scale to match the environment’s action bounds.

Critic (action-value function)#

The critic estimates the Q-value for a state–action pair:

\[Q_\phi(s,a) \approx Q^{\mu}(s,a)\]

Target networks (the stabilizer)#

Bootstrapping makes the target depend on the current function approximators. To reduce moving-target instability, DDPG maintains slowly-updated copies:

  • target actor: \(\mu_{\theta'}\)

  • target critic: \(Q_{\phi'}\)

Soft-update them after each gradient step:

\[\theta' \leftarrow \tau\,\theta + (1-\tau)\,\theta'\]
\[\phi' \leftarrow \tau\,\phi + (1-\tau)\,\phi'\]

Critic target (precise)#

For a transition \((s,a,r,s',d)\) sampled from replay (where \(d\in\{0,1\}\) indicates terminal), the TD target is

\[y = r + \gamma(1-d)\,Q_{\phi'}\big(s',\mu_{\theta'}(s')\big)\]

and we fit the critic via

\[\mathcal{L}(\phi) = \mathbb{E}\big[(Q_\phi(s,a)-y)^2\big].\]

Actor objective (deterministic policy gradient)#

The actor is trained to maximize the critic’s value under its actions:

\[J(\theta) = \mathbb{E}_{s\sim\mathcal{D}}\big[Q_\phi(s,\mu_\theta(s))\big].\]

In code we minimize the actor loss

\[\mathcal{L}_{actor}(\theta) = -\mathbb{E}\big[Q_\phi(s,\mu_\theta(s))\big].\]

The gradient is the deterministic policy gradient:

\[\nabla_\theta J(\theta) = \mathbb{E}\left[\nabla_a Q_\phi(s,a)\rvert_{a=\mu_\theta(s)}\,\nabla_\theta \mu_\theta(s)\right].\]

PyTorch computes this automatically when we backprop through Q(s, actor(s)).

2) Algorithm sketch (pseudocode)#

  1. Initialize actor \(\mu_\theta\), critic \(Q_\phi\)

  2. Initialize target networks \(\mu_{\theta'}\leftarrow\mu_\theta\), \(Q_{\phi'}\leftarrow Q_\phi\)

  3. Initialize replay buffer \(\mathcal{D}\)

  4. For each environment step:

    • act with exploration: \(a=\mu_\theta(s)+\epsilon\)

    • store \((s,a,r,s',d)\) in \(\mathcal{D}\)

    • sample minibatch from \(\mathcal{D}\)

    • critic: regress \(Q_\phi(s,a)\) to \(y=r+\gamma(1-d)Q_{\phi'}(s',\mu_{\theta'}(s'))\)

    • actor: ascend \(\nabla_\theta Q_\phi(s,\mu_\theta(s))\)

    • soft update targets: \((\theta',\phi')\leftarrow \tau(\theta,\phi)+(1-\tau)(\theta',\phi')\)

import math
import platform
import time
from dataclasses import dataclass

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly
import os
import plotly.io as pio

try:
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    TORCH_AVAILABLE = True
except Exception as e:
    TORCH_AVAILABLE = False
    _TORCH_IMPORT_ERROR = e

# Gymnasium first; fall back to gym
try:
    import gymnasium as gym
    GYM_BACKEND = 'gymnasium'
except Exception:
    import gym
    GYM_BACKEND = 'gym'

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

print('Python', platform.python_version())
print('NumPy', np.__version__)
print('Pandas', pd.__version__)
print('Plotly', plotly.__version__)
print('Gym backend', GYM_BACKEND, 'version', gym.__version__)
print('Torch', torch.__version__ if TORCH_AVAILABLE else _TORCH_IMPORT_ERROR)
Python 3.12.9
NumPy 1.26.2
Pandas 2.1.3
Plotly 6.5.2
Gym backend gymnasium version 1.1.1
Torch 2.7.0+cu126
# --- Run configuration ---
FAST_RUN = True  # set False for longer training

ENV_ID = 'Pendulum-v1'
SEED = 42

NUM_EPISODES = 40 if FAST_RUN else 250
MAX_STEPS_PER_EPISODE = None  # None means use env default

REPLAY_SIZE = 200_000
BATCH_SIZE = 128
GAMMA = 0.99
TAU = 0.005

ACTOR_LR = 1e-3
CRITIC_LR = 1e-3

START_STEPS = 2_000  # random actions before using the actor + noise
UPDATE_AFTER = 1_000  # start gradient updates after this many steps
UPDATES_PER_STEP = 1

NOISE_SIGMA = 0.1  # exploration noise std (in action units after scaling)

HIDDEN_SIZES = (256, 256)
GRAD_CLIP_NORM = 1.0

PROBE_N = 32
PROBE_EVERY_EPISODES = 5

DEVICE = 'cuda' if TORCH_AVAILABLE and torch.cuda.is_available() else 'cpu'
print('DEVICE:', DEVICE)
DEVICE: cpu
/home/tempa/miniconda3/lib/python3.12/site-packages/torch/cuda/__init__.py:174: UserWarning:

CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
def set_global_seeds(seed: int):
    np.random.seed(seed)
    if TORCH_AVAILABLE:
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)


def env_reset(env, seed: int | None = None):
    out = env.reset(seed=seed) if seed is not None else env.reset()
    if isinstance(out, tuple):
        obs, info = out
    else:
        obs, info = out, {}
    return obs, info


def env_step(env, action):
    out = env.step(action)
    if len(out) == 5:
        next_obs, reward, terminated, truncated, info = out
        done = bool(terminated or truncated)
    else:
        next_obs, reward, done, info = out
        done = bool(done)
    return next_obs, float(reward), done, info


def make_env(env_id: str, seed: int):
    env = gym.make(env_id)
    _ = env_reset(env, seed=seed)
    try:
        env.action_space.seed(seed)
        env.observation_space.seed(seed)
    except Exception:
        pass
    return env


def action_scale_and_bias(action_space):
    # Works for gymnasium.spaces.Box and gym.spaces.Box
    high = np.asarray(action_space.high, dtype=np.float32)
    low = np.asarray(action_space.low, dtype=np.float32)
    scale = (high - low) / 2.0
    bias = (high + low) / 2.0
    return scale, bias
class ReplayBuffer:
    def __init__(self, obs_dim: int, act_dim: int, size: int, seed: int):
        self.obs_buf = np.zeros((size, obs_dim), dtype=np.float32)
        self.next_obs_buf = np.zeros((size, obs_dim), dtype=np.float32)
        self.act_buf = np.zeros((size, act_dim), dtype=np.float32)
        self.rew_buf = np.zeros((size, 1), dtype=np.float32)
        self.done_buf = np.zeros((size, 1), dtype=np.float32)

        self.max_size = int(size)
        self.ptr = 0
        self.size = 0
        self.rng = np.random.default_rng(seed)

    def add(self, obs, act, rew: float, next_obs, done: bool):
        self.obs_buf[self.ptr] = obs
        self.act_buf[self.ptr] = act
        self.rew_buf[self.ptr] = rew
        self.next_obs_buf[self.ptr] = next_obs
        self.done_buf[self.ptr] = float(done)

        self.ptr = (self.ptr + 1) % self.max_size
        self.size = min(self.size + 1, self.max_size)

    def sample(self, batch_size: int):
        idx = self.rng.integers(0, self.size, size=batch_size)
        batch = dict(
            obs=self.obs_buf[idx],
            act=self.act_buf[idx],
            rew=self.rew_buf[idx],
            next_obs=self.next_obs_buf[idx],
            done=self.done_buf[idx],
        )
        return batch
def mlp(sizes, activation=nn.ReLU, output_activation=nn.Identity):
    layers = []
    for i in range(len(sizes) - 1):
        act = activation if i < len(sizes) - 2 else output_activation
        layers += [nn.Linear(sizes[i], sizes[i + 1]), act()]
    return nn.Sequential(*layers)


class Actor(nn.Module):
    def __init__(self, obs_dim: int, act_dim: int, hidden_sizes, action_scale, action_bias):
        super().__init__()
        self.net = mlp([obs_dim, *hidden_sizes, act_dim], activation=nn.ReLU, output_activation=nn.Tanh)
        self.register_buffer('action_scale', torch.as_tensor(action_scale, dtype=torch.float32))
        self.register_buffer('action_bias', torch.as_tensor(action_bias, dtype=torch.float32))

    def forward(self, obs):
        a = self.net(obs)
        return self.action_scale * a + self.action_bias


class Critic(nn.Module):
    def __init__(self, obs_dim: int, act_dim: int, hidden_sizes):
        super().__init__()
        self.net = mlp([obs_dim + act_dim, *hidden_sizes, 1], activation=nn.ReLU, output_activation=nn.Identity)

    def forward(self, obs, act):
        x = torch.cat([obs, act], dim=-1)
        return self.net(x)
@dataclass
class DDPGConfig:
    gamma: float = GAMMA
    tau: float = TAU
    actor_lr: float = ACTOR_LR
    critic_lr: float = CRITIC_LR
    batch_size: int = BATCH_SIZE
    grad_clip_norm: float | None = GRAD_CLIP_NORM


class DDPGAgent:
    def __init__(self, obs_dim: int, act_dim: int, action_scale, action_bias, hidden_sizes, device: str, cfg: DDPGConfig):
        self.device = torch.device(device)
        self.cfg = cfg

        self.actor = Actor(obs_dim, act_dim, hidden_sizes, action_scale, action_bias).to(self.device)
        self.critic = Critic(obs_dim, act_dim, hidden_sizes).to(self.device)

        # Target networks start as exact copies
        self.target_actor = Actor(obs_dim, act_dim, hidden_sizes, action_scale, action_bias).to(self.device)
        self.target_critic = Critic(obs_dim, act_dim, hidden_sizes).to(self.device)
        self.target_actor.load_state_dict(self.actor.state_dict())
        self.target_critic.load_state_dict(self.critic.state_dict())

        self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=cfg.actor_lr)
        self.critic_opt = torch.optim.Adam(self.critic.parameters(), lr=cfg.critic_lr)

    @torch.no_grad()
    def act(self, obs: np.ndarray, noise_sigma: float = 0.0):
        obs_t = torch.as_tensor(obs, dtype=torch.float32, device=self.device).unsqueeze(0)
        action = self.actor(obs_t).cpu().numpy().squeeze(0)
        if noise_sigma > 0:
            action = action + np.random.normal(0.0, noise_sigma, size=action.shape).astype(np.float32)
        return action

    def update(self, batch):
        obs = torch.as_tensor(batch['obs'], dtype=torch.float32, device=self.device)
        act = torch.as_tensor(batch['act'], dtype=torch.float32, device=self.device)
        rew = torch.as_tensor(batch['rew'], dtype=torch.float32, device=self.device)
        next_obs = torch.as_tensor(batch['next_obs'], dtype=torch.float32, device=self.device)
        done = torch.as_tensor(batch['done'], dtype=torch.float32, device=self.device)

        # --- Critic update ---
        with torch.no_grad():
            next_act = self.target_actor(next_obs)
            target_q_next = self.target_critic(next_obs, next_act)
            y = rew + self.cfg.gamma * (1.0 - done) * target_q_next

        q = self.critic(obs, act)
        critic_loss = F.mse_loss(q, y)

        self.critic_opt.zero_grad(set_to_none=True)
        critic_loss.backward()
        if self.cfg.grad_clip_norm is not None:
            torch.nn.utils.clip_grad_norm_(self.critic.parameters(), self.cfg.grad_clip_norm)
        self.critic_opt.step()

        # --- Actor update ---
        self.actor_opt.zero_grad(set_to_none=True)
        actor_actions = self.actor(obs)
        actor_loss = -self.critic(obs, actor_actions).mean()
        actor_loss.backward()
        if self.cfg.grad_clip_norm is not None:
            torch.nn.utils.clip_grad_norm_(self.actor.parameters(), self.cfg.grad_clip_norm)
        self.actor_opt.step()

        # --- Soft update target networks ---
        with torch.no_grad():
            for p, p_targ in zip(self.actor.parameters(), self.target_actor.parameters()):
                p_targ.data.mul_(1.0 - self.cfg.tau)
                p_targ.data.add_(self.cfg.tau * p.data)

            for p, p_targ in zip(self.critic.parameters(), self.target_critic.parameters()):
                p_targ.data.mul_(1.0 - self.cfg.tau)
                p_targ.data.add_(self.cfg.tau * p.data)

        metrics = {
            'critic_loss': float(critic_loss.item()),
            'actor_loss': float(actor_loss.item()),
            'q_mean': float(q.detach().mean().item()),
            'y_mean': float(y.detach().mean().item()),
        }
        return metrics
def moving_average(x, window: int):
    x = np.asarray(x, dtype=np.float64)
    if len(x) < window:
        return x
    kernel = np.ones(window) / window
    return np.convolve(x, kernel, mode='valid')


def train_ddpg(env_id: str, seed: int):
    set_global_seeds(seed)
    env = make_env(env_id, seed=seed)

    obs_dim = int(np.prod(env.observation_space.shape))
    act_dim = int(np.prod(env.action_space.shape))

    act_scale, act_bias = action_scale_and_bias(env.action_space)

    buf = ReplayBuffer(obs_dim, act_dim, size=REPLAY_SIZE, seed=seed)
    agent = DDPGAgent(
        obs_dim=obs_dim,
        act_dim=act_dim,
        action_scale=act_scale,
        action_bias=act_bias,
        hidden_sizes=HIDDEN_SIZES,
        device=DEVICE,
        cfg=DDPGConfig(),
    )

    max_steps = MAX_STEPS_PER_EPISODE or getattr(env, '_max_episode_steps', 200)

    logs = {
        'episode': [],
        'episode_return': [],
        'episode_length': [],
        'global_step_end': [],
        # per-update metrics
        'update_step': [],
        'actor_loss': [],
        'critic_loss': [],
        'q_mean': [],
        'y_mean': [],
        # probe snapshots
        'probe_episode': [],
        'probe_action_stat': [],
        'probe_q': [],
    }

    probe_states = None

    global_step = 0
    update_step = 0

    t0 = time.time()
    for ep in range(1, NUM_EPISODES + 1):
        obs, _ = env_reset(env, seed=seed + ep)
        obs = np.asarray(obs, dtype=np.float32).reshape(-1)

        ep_return = 0.0
        ep_len = 0

        for _ in range(max_steps):
            if global_step < START_STEPS:
                action = env.action_space.sample()
            else:
                action = agent.act(obs, noise_sigma=NOISE_SIGMA)

            # clip to action bounds
            action = np.clip(action, env.action_space.low, env.action_space.high).astype(np.float32)

            next_obs, reward, done, _ = env_step(env, action)
            next_obs = np.asarray(next_obs, dtype=np.float32).reshape(-1)

            buf.add(obs, action, reward, next_obs, done)

            obs = next_obs
            ep_return += reward
            ep_len += 1
            global_step += 1

            # gradient updates
            if global_step >= UPDATE_AFTER and buf.size >= BATCH_SIZE:
                for _u in range(UPDATES_PER_STEP):
                    batch = buf.sample(BATCH_SIZE)
                    metrics = agent.update(batch)

                    logs['update_step'].append(update_step)
                    logs['actor_loss'].append(metrics['actor_loss'])
                    logs['critic_loss'].append(metrics['critic_loss'])
                    logs['q_mean'].append(metrics['q_mean'])
                    logs['y_mean'].append(metrics['y_mean'])
                    update_step += 1

            if done:
                break

        logs['episode'].append(ep)
        logs['episode_return'].append(ep_return)
        logs['episode_length'].append(ep_len)
        logs['global_step_end'].append(global_step)

        # Fix a set of probe states once replay has enough data
        if probe_states is None and buf.size >= max(PROBE_N, BATCH_SIZE):
            probe_states = buf.sample(PROBE_N)['obs']

        # Snapshot policy + Q on probe states to visualize policy evolution
        if probe_states is not None and (ep % PROBE_EVERY_EPISODES == 0 or ep == NUM_EPISODES):
            with torch.no_grad():
                ps = torch.as_tensor(probe_states, dtype=torch.float32, device=agent.device)
                pa = agent.actor(ps).cpu().numpy()
                pq = agent.critic(ps, agent.actor(ps)).cpu().numpy().reshape(-1)

            if pa.shape[1] == 1:
                policy_stat = pa[:, 0]  # 1D actions
            else:
                policy_stat = np.linalg.norm(pa, axis=1)  # multi-dim summary

            logs['probe_episode'].append(ep)
            logs['probe_action_stat'].append(policy_stat)
            logs['probe_q'].append(pq)

        if ep % 10 == 0 or ep == 1 or ep == NUM_EPISODES:
            elapsed = time.time() - t0
            print(f'Episode {ep:4d} | return {ep_return:8.1f} | len {ep_len:3d} | steps {global_step:6d} | elapsed {elapsed:6.1f}s')

    env.close()
    return logs


logs = train_ddpg(ENV_ID, seed=SEED)
print('Done. Episodes:', len(logs['episode']), 'Updates:', len(logs['update_step']))
Episode    1 | return   -865.7 | len 200 | steps    200 | elapsed    0.0s
Episode   10 | return   -836.3 | len 200 | steps   2000 | elapsed   11.1s
Episode   20 | return   -257.9 | len 200 | steps   4000 | elapsed   40.1s
Episode   30 | return   -127.4 | len 200 | steps   6000 | elapsed   71.2s
Episode   40 | return   -381.6 | len 200 | steps   8000 | elapsed  102.5s
Done. Episodes: 40 Updates: 7001

3) Plotly diagnostics#

DDPG can look like it’s learning while the critic is quietly diverging, so we’ll monitor:

  • episode return (score)

  • critic loss and actor loss

  • Q-values vs TD targets (sanity check)

  • policy evolution on fixed probe states (is the policy drifting smoothly?)

# --- Learning curve: score per episode ---
df_ep = pd.DataFrame({
    'episode': logs['episode'],
    'return': logs['episode_return'],
    'length': logs['episode_length'],
})

ma_window = 10
ma = moving_average(df_ep['return'].values, window=ma_window)
ma_x = df_ep['episode'].values[ma_window - 1:]

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_ep['episode'], y=df_ep['return'], mode='lines', name='Return'))
if len(ma) == len(ma_x):
    fig.add_trace(go.Scatter(x=ma_x, y=ma, mode='lines', name=f'Return (MA {ma_window})'))
fig.update_layout(title='DDPG learning curve (score per episode)', xaxis_title='Episode', yaxis_title='Return')
fig.show()
# --- Q-values, TD targets, and losses over update steps ---
df_up = pd.DataFrame({
    'update_step': logs['update_step'],
    'critic_loss': logs['critic_loss'],
    'actor_loss': logs['actor_loss'],
    'q_mean': logs['q_mean'],
    'y_mean': logs['y_mean'],
})

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_up['update_step'], y=df_up['q_mean'], mode='lines', name='Q(s,a) mean'))
fig.add_trace(go.Scatter(x=df_up['update_step'], y=df_up['y_mean'], mode='lines', name='TD target y mean'))
fig.update_layout(title='Critic outputs vs TD targets (mean over minibatch)', xaxis_title='Update step', yaxis_title='Value')
fig.show()

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_up['update_step'], y=df_up['critic_loss'], mode='lines', name='Critic loss (MSE)'))
fig.add_trace(go.Scatter(x=df_up['update_step'], y=df_up['actor_loss'], mode='lines', name='Actor loss (-Q)'))
fig.update_layout(title='Actor/Critic losses', xaxis_title='Update step', yaxis_title='Loss')
fig.show()
# --- Policy evolution on fixed probe states ---
if len(logs['probe_episode']) > 0:
    probe_eps = logs['probe_episode']
    z_action = np.stack(logs['probe_action_stat'], axis=1)  # (PROBE_N, T)
    z_q = np.stack(logs['probe_q'], axis=1)  # (PROBE_N, T)

    fig = go.Figure(data=go.Heatmap(
        z=z_action,
        x=probe_eps,
        y=list(range(z_action.shape[0])),
        colorscale='RdBu',
        zmid=0.0,
        colorbar=dict(title='action (1D) or ||a||'),
    ))
    fig.update_layout(title='Policy evolution on fixed probe states', xaxis_title='Episode snapshot', yaxis_title='Probe state index')
    fig.show()

    fig = go.Figure(data=go.Heatmap(
        z=z_q,
        x=probe_eps,
        y=list(range(z_q.shape[0])),
        colorscale='Viridis',
        colorbar=dict(title='Q(s, mu(s))'),
    ))
    fig.update_layout(title='Q-values on probe states (critic under current actor)', xaxis_title='Episode snapshot', yaxis_title='Probe state index')
    fig.show()
else:
    print('No probe snapshots recorded (try increasing NUM_EPISODES or reducing PROBE_EVERY_EPISODES).')

4) Stable-Baselines implementation (if you want a reference)#

If you want a battle-tested baseline, Stable-Baselines has DDPG implementations.

Notes:

  • stable-baselines3 (PyTorch) and stable-baselines (TensorFlow) are different packages.

  • This repository’s environment may not have them installed; the code below is for reference.

Stable-Baselines3 (PyTorch)#

# pip install stable-baselines3
import numpy as np
import gymnasium as gym

from stable_baselines3 import DDPG
from stable_baselines3.common.noise import NormalActionNoise

env = gym.make('Pendulum-v1')

n_actions = env.action_space.shape[0]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))

model = DDPG('MlpPolicy', env, action_noise=action_noise, verbose=1)
model.learn(total_timesteps=200_000)

Stable-Baselines (TensorFlow; older/archived)#

# pip install stable-baselines
import numpy as np
import gym

from stable_baselines import DDPG
from stable_baselines.ddpg.policies import MlpPolicy
from stable_baselines.common.noise import NormalActionNoise

env = gym.make('Pendulum-v1')

n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))

model = DDPG(MlpPolicy, env, action_noise=action_noise, verbose=1)
model.learn(total_timesteps=200_000)

5) Pitfalls + diagnostics#

  • Exploration: deterministic policies need explicit noise; too little noise → no learning.

  • Q-value blow-up: if \(Q\) grows without bound, reduce learning rates, add gradient clipping, check reward scale.

  • Action scaling: always scale tanh outputs to environment bounds; otherwise the critic learns on invalid actions.

  • Replay warm-up: start updates only after enough diverse transitions exist.

  • Overestimation bias: DDPG can overestimate; TD3 addresses this with twin critics + target smoothing.

6) Hyperparameters explained (the ones that matter)#

GAMMA (\(\gamma\))#

Discount factor in the TD target:

\[y=r+\gamma(1-d)Q_{\phi'}(s',\mu_{\theta'}(s')).\]
  • closer to 1 → longer-horizon credit assignment, but bootstrapping is harder

  • smaller → more myopic, often more stable

TAU (\(\tau\))#

Soft-update rate for target networks:

\[\theta'\leftarrow\tau\theta+(1-\tau)\theta'.\]
  • smaller (e.g. 0.001) → targets change slowly (stable, but may learn slower)

  • larger (e.g. 0.02) → targets track faster (less bias, potentially less stable)

REPLAY_SIZE#

Maximum transitions stored.

  • too small → poor diversity, correlated samples

  • very large → more diversity but older data (off-policy mismatch) and more memory

BATCH_SIZE#

Minibatch size for gradient updates.

  • larger → smoother gradients, higher compute

  • smaller → noisier updates (can help exploration but can destabilize critic)

START_STEPS#

How long to act randomly before relying on the actor.

  • helps fill replay with diverse transitions

  • if too short, early actor updates overfit to narrow experience

UPDATE_AFTER#

Delay before starting gradient updates.

  • ensures the critic’s first targets aren’t based on tiny replay buffers

UPDATES_PER_STEP#

How many gradient updates to do per environment step.

  • 1 is the standard simple choice

  • larger values increase sample reuse but can overfit to replay and amplify instability

NOISE_SIGMA#

Exploration noise standard deviation (added to the actor’s action).

  • too small → agent may not discover better actions

  • too large → behavior becomes too random; critic targets get noisy

HIDDEN_SIZES#

Network capacity for actor/critic.

  • bigger networks can fit complex Q-functions but may be harder to train

GRAD_CLIP_NORM#

Gradient norm clipping (optional).

  • helps prevent occasional exploding gradients in the critic

7) Exercises + references#

Exercises#

  1. Replace Gaussian exploration with Ornstein–Uhlenbeck noise and compare learning.

  2. Add LayerNorm to the actor/critic MLPs; does it stabilize training?

  3. Implement TD3 changes (twin critics + target policy smoothing) and compare the Q-value diagnostics.

References#

  • Lillicrap et al., Continuous control with deep reinforcement learning (DDPG): https://arxiv.org/abs/1509.02971

  • OpenAI Spinning Up (DDPG explanation + tips): https://spinningup.openai.com/en/latest/algorithms/ddpg.html

  • Stable-Baselines (archived TF implementations): https://github.com/hill-a/stable-baselines

  • Stable-Baselines3 docs: https://stable-baselines3.readthedocs.io/